Investigating Inner Properties of Multimodal Representation and Semantic Compositionality with Brain-based Componential Semantics

نویسندگان

  • Shaonan Wang
  • Jiajun Zhang
  • Nan Lin
  • Chengqing Zong
چکیده

Multimodal models have been proven to outperform textbased approaches on learning semantic representations. However, it still remains unclear what properties are encoded in multimodal representations, in what aspects do they outperform the single-modality representations, and what happened in the process of semantic compositionality in different input modalities. Considering that multimodal models are originally motivated by human concept representations, we assume that correlating multimodal representations with brain-based semantics would interpret their inner properties to answer the above questions. To that end, we propose simple interpretation methods based on brain-based componential semantics. First we investigate the inner properties of multimodal representations by correlating them with corresponding brain-based property vectors. Then we map the distributed vector space to the interpretable brain-based componential space to explore the inner properties of semantic compositionality. Ultimately, the present paper sheds light on the fundamental questions of natural language understanding, such as how to represent the meaning of words and how to combine word meanings into larger units. Introduction Multimodal models that learn semantic representations using both linguistic and perceptual inputs are originally motivated by human concept learning and the evidence that many concept representations in the brain are grounded in perception (Andrews, Vigliocco, and Vinson 2009). The perceptual information in such models is derived from images (Roller and Im Walde 2013; Collell, Zhang, and Moens 2017), sounds (Kiela and Clark 2015), or data collected in psychological experiments (Johns and Jones 2012; Hill and Korhonen 2014; Andrews, Vigliocco, and Vinson 2009). Multimodal methods have been proven to outperform text-based approaches on a range of tasks, including modeling semantic similarity of two words or sentences and finding the most similar images to a word (Bruni, Tran, and Baroni 2014; Lazaridou, Pham, and Baroni 2015; Kurach et al. 2017). Despite of their superiority, what happened inside is hard to be interpreted and many questions have been unexplored. For example, it is still unclear 1) what properties are encoded Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. in multimodal representations, and in what aspects do they outperform single-modality representations. 2) Whether different semantic combination rules are encoded in different input modalities, and how different composition models combine inner properties of semantic representations. Accordingly, to facilitate the development of better multimodal models, it is desirable to efficiently compare and investigate the inner properties of different semantic representations and different composition models. Experiments with brain imaging tools have accumulated evidence indicating that human concept representations are at least partly embodied in perception, action, and other modal neural systems related to individual experiences (Binder and Desai 2011). In summary of the previous work, Binder et al. (2016) propose the “brain-based componential semantic representations” based entirely on such functional divisions in the human brain, and represent concepts by sets of properties like vision, somatic, audition, spatial, and emotion. Since multimodal models, in some extent, simulate human concept learning to capture the perceptual information that is nicely encoded in the human brain, we assume that correlating them with brain-based semantics in a proper way would interpret the inner properties of multimodal representations and semantic compositionality. To that end, we first propose a simple correlation method, which utilizes the brain-based componential semantic vectors (Binder et al. 2016) to investigate the inner properties of multimodal word representations. Our method calculates correlations between the relation matrix given by brain-based property vectors and multimodal word vectors. The resulting correlation score represents the capability of the multimodal word vectors in capturing the brain-based semantic property. Then we employ a mapping method to explore how semantic compositionality works in different input modalities. Specifically, we learn a mapping function from the distributed semantic space to the brain-based componential space. After mapping word and phrase representations to the (interpretable) brain-based semantic space, we compare the transformations of their inner properties in the process of combining word representations into phrases. Our results show that 1) single modality vectors from different sources encode complementary semantics in the brain, giving multimodal models the potential to better represent concept meanings. 2) The multimodal models imar X iv :1 71 1. 05 51 6v 1 [ cs .C L ] 1 5 N ov 2 01 7 prove text-based models on sensory and motor properties, but degrade the representation quality of abstract properties. 3) The different input modalities have similar effects on inner properties of semantic representations when combining words into phrases, indicating that the semantic compositionality is a general process which is irrespective of input modalities. 4) Different composition models combine the inner properties of constituent word representations in a different way, and the Matrix model best simulate the semantic compositionality in multimodal environment. Related Work Investigation of word representations There have been some researches on interpreting word representations. Most work investigates the inner properties of semantic representations by correlating them with linguistic features (Ling and Dyer 2015; Yogatama and Smith 2015; Qiu and Huang 2016). Besides, Rubinstein et al. (2015) and Collell and Moens (2016) evaluate the capabilities of linguistic and visual representations respectively by predicting word features. They utilize the McRae Feature Norms dataset (McRae et al. 2005), which contains 541 words with a total of 2,526 features such as an animal, clothing and is fast. These work can be seen as the foreshadowing of our experimental paradigm that correlating dense vectors with a sparse feature space. Different from the above work, we utilize the brain-based semantic representations. This dataset contains the basic semantic units directly linked to the human brain, and thus is more complete and more cognitively plausible to represent concept meaning. Furthermore, it is worth noting that all these work does not focus on multimodal representations, and lacks a direct comparison between unimodal representations and multimodal representations. This is exactly our novelty and contribution. Investigation of semantic compositionality Semantic compositionality has been explored by different types of composition models (Mitchell and Lapata 2010; Dinu et al. 2013; Wang and Zong 2017; Wang, Zhang, and Zong 2017a; Wang, Zhang, and Zong 2017b; Wang, Zhang, and Zong 2018). Still, dimensions in many semantic vector space have no clear meaning and thus it is difficult to interpret how different composition models work. Fyshe et al. (2015) tackle this problem by utilizing sparse vector spaces. They use the intruder task to quantify the interpretability of semantic dimensions, which needs manual labeling and the results are not intuitive. Li et al. (2015) use visualizing methods by projecting words, phrases and sentences into twodimensional space. This method shows the semantic distance between words, phrases and sentences, but can not explain what happened inside composition. The semantic compositionality in computer vision does not receive as much attention as in natural language area. To our best knowledge, the following two studies are most relevant to our work. Nguyen et al. (2014) model compositionality of attributes and objects in the visual modality as done in the case of adjective-noun composition in the linguistic modality. Their results show that the concept topologies and semantic compositionality in the two modalities share similarities. Pezzelle et al. (2016) investigate the problem of noun-noun composition in vision. They find that a simple Addition model is effective in achieving visual compositionality. This paper takes a step further, and provides a direct and comprehensive investigation of the composition process in both linguistic and visual modalities. Furthermore, we conduct a pioneer work on multimodal semantic compositional semantics, in which multi-modal word representations are combined to obtain phrase representations. Taken together, our work offers some insights into the behavior of semantic compositionality. Human concept representations and composition Classical componential theories of lexical semantics assume that concepts can be represented by sets of primitive features, which are problematic in that these features are themselves complex concepts. Binder et al. (2016) tackle this problem by resorting to brain imaging studies. They propose the “brain-based componential semantics” based entirely on functional divisions in the human brain, and represent concepts by sets of properties like vision, somatic, audition, spatial, and emotion. The brain-based semantic representations are highly correlated with the brain imaging data, and have been used as an intermediate semantic representations in exploring human semantics (Anderson et al. 2016). There is previous work exploring the question of semantic composition in the human brain (Chang 2011; Fyshe 2015). To infer how semantic composition works in the brain, they conduct brain imaging experiments of participants viewing words and phrases, and analyze these data by adopting vector-based composition models. Results illustrate that Multiplication model outperforms Addition model on adjective-noun phrase composition, indicating that people use adjectives to modify the meaning of the nouns. Unlike these work, this paper aims to interpret the inner properties of different composition models in achieving compositionality. We hope that the proposed method can feed back into neuroscience to help exploring human concept representations and composition. Brain-based Componential Semantic Representations The brain-based componential semantic dataset is proposed by Binder et al. (2016), which contains 535 different types of concepts1. Each concept has 14 properties, i.e., vision, somatic, audition, gustation, olfaction, motor, spatial, temporal, causal, social, cognition, emotion, drive, attention, and each property contains several attributes (1∼15). For instance, the vision property is described with attributes of bright, dark, color, pattern, large, small, etc. Through crowd-sourced rating experiments, each attribute of all 535 concepts is assessed with a saliency score (0∼6). Figure 1 These are 122 abstract words and 413 concrete words including nouns, verbs and adjectives. The dataset can be found at: http://www.neuro.mcw.edu/resources.html

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Discourse Analysis of “The Prince and His Companions” in Kelileh and Demneh Based On Semio-Semantics

Despite showing an overtly simple structure, the semantic process in classic literary-narrative discourse conforms to complicated semiotic systems. As a result, semio-semantics is deemed as one of the most scientific, reliable tools since it helps intradiscursive semio-textual propositions be phenomenologically, and even epistemologically, analyzed. Consequently, the narrative discourse in “The...

متن کامل

Lexical Semantics and Selection of TAM in Bantu Languages: A Case of Semantic Classification of Kiswahili Verbs

The existing literature on Bantu verbal semantics demonstrated that inherent semantic content of verbs pairs directly with the selection of tense, aspect and modality formatives in Bantu languages like Chasu, Lucazi, Lusamia, and Shiyeyi. Thus, the gist of this paper is the articulation of semantic classification of verbs in Kiswahili based on the selection of TAM types. This is because the sem...

متن کامل

A Tradeoff between Compositionality and Complexity in the Semantics of Dimensional Adjectives

Linguistic access to uncertain quantitative knowledge about physical properties is provided by d imens ional adjectives, e.g. long-short in the spatial and temporal senses, near-far, fast-slow, etc. Semantic analyses of the dimensional adjectives differ on whether the meaning of the differential comparative (6 cm shorter than) and the equative with factor term (three times as long as) is a comp...

متن کامل

Semantics-based Representation for Multimodal Interpretation in Conversational Systems

To support context-based multimodal interpretation in conversational systems, we have developed a semantics-based representation to capture salient information from user inputs and the overall conversation. In particular, we present three unique characteristics: fine-grained semantic models, flexible composition of feature structures, and consistent representation at multiple levels. This repre...

متن کامل

Automatic Hashtag Recommendation in Social Networking and Microblogging Platforms Using a Knowledge-Intensive Content-based Approach

In social networking/microblogging environments, #tag is often used for categorizing messages and marking their key points. Also, since some social networks such as twitter apply restrictions on the number of characters in messages, #tags can serve as a useful tool for helping users express their messages. In this paper, a new knowledge-intensive content-based #tag recommendation system is intr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1711.05516  شماره 

صفحات  -

تاریخ انتشار 2017